\section{A Case Study: VLIW Hardware/Software Design Space}
\secL{ee}
Although PS has been formally introduced and described as a general
multi-objective optimization algorithm, in this work we will
explicitly refer to the more restricted (but still very large)
embedded system design scenario,  since its properties
are the ones that suggested to the authors the concepts of Pareto-based
innovation and region scoring system.

In particular, we use a parametrized Very Long Instruction
Word (VLIW) architecture~\cite{kathail_tr00} as testbed. The philosophy
behind a VLIW system makes this a natural choice for several reasons:
\begin{itemize}
\item \emph{Multi-Objective trade-offs}: VLIW architectures allow designers
to balance power, energy and performances objectives, resulting
in extended Pareto sets which are an ideal ground for our testing
purposes focused on parameter-space representation
\item \emph{Software Compiler awareness}: code scheduling for the execution of
applications is statically obtained by the compiler which
performs code transformations ``being aware'' of the underlying hardware
configuration.  This tight hardware/software dependence makes the
VLIW scenario perfectly suitable for investigating hardly predictable
parameter interactions.
\end{itemize}

In this section, we will also briefly describe the parametrized platform used
as testbed for the experiments, the general evaluation flow along with
the high-level estimation models and applications used as benchmarks.

\subsection{Design Space: Hardware and Software Parameters}
\secL{designspace}
Conceptually, system parameters can be classified in three main categories:
\emph{processor}, \emph{memory sub-system} and \emph{compiler}. The
first two categories are directly related to the physical
implementation of the system, and thus parameters of those
classes are usually referred as ``hardware parameters''. On the other
side, all the parameters affecting the behavior of the compiler
are referred as ``software parameters'', being directly involved in
the process of generating the application executable running on the
underlying hardware.

Processor hardware parameters are directly related to the VLIW core
and include the size of the register files and the number of
functional units of each type. On the software side, a set of the most
impacting code compilation parameters has been selected.
Table~\ref{tab:param} shows the complete set of parameters included in
the design space and their possible values. The total size of the design
space, which grows with the product of parameter cardinalities, is
about $1.33 \times 10^{14}$.


As regards the class of benchmark being considered, a set of applications representative of some common
frequently running code kernels has been selected. Table~\ref{tab:bench} shows the set of applications
along with a brief description.
% IEEE table style
%\begin{table}
%	\centering
%	\caption{Benchmarks}
%	\label{tab:bench}
%	\begin{tabular}{ll}
%	\hline
%	\multicolumn{1}{c}{Benchmark} & \multicolumn{1}{c}{Application} \\
%	\hline
%	Alloca\_test & Memory array allocation test \\
%	Bmm & Matrices multiplication and elements sum \\
%	Fir\_int & Finite impulse response \\
%	Mm & Floating point matrices multiplication \\
%	Sha256 & Cryptocurrency header Hashing \\
%	Wave & Wavefront computation \\
%	\hline
%	\end{tabular}
%\end{table}


\begin{table}
	\tbl{Design Space\label{tab:param}}{
	\begin{tabular}{lll}
	\hline
	\multicolumn{1}{c}{Parameter} & \multicolumn{1}{c}{Values} & \multicolumn{1}{c}{Type}\\
	\hline
	General Purpose Registers & 16, 32, 64, 128 & HW \\
	Floating Point Registers & 16, 32, 64, 128 & HW \\
	Predicate Instructions registers & 32, 64, 128 & HW \\
	Branch Target Registers & 16, 32, 64 & HW \\
	Control Registers & 32, 64, 128 & HW \\
	\hline
	Integer Functional Units & 1, 2, 3, 4, 5, 6 & HW \\
	Floating Point Units & 1, 2, 3, 4, 5, 6 & HW\\
	Memory Units & 1, 2, 3, 4  & HW \\
	Branch Units & 1, 2, 3, 4  & HW \\
	\hline
	L1 Data Cache Size (KB) & 1, 2, 4, 8, 16, 32, 64, 128 & HW \\
	L1 Data Cache Blocksize & 32, 68, 128 & HW \\
	L1 Data Cache Associativity (KB) & 1, 2, 4 & HW \\
	L1 Instruction Cache Size (KB) & 1, 2, 4, 8, 16, 32, 64, 128 & HW \\
	L1 Instruction Cache Blocksize & 32, 68, 128 & HW \\
	L1 Instruction Cache Associativity (KB) & 1, 2, 4 & HW \\
	L2 Unified Cache Size (KB) & 32, 64, 128, 256, 512  & HW\\
	L2 Unified Cache Blocksize & 32, 68, 128 & HW \\
	L2 Unified Cache Associativity (KB) & 2, 4, 8 & HW \\
	\hline
        tcc\_region: Scope of action of the compiler & basic block, super block, hyper block & SW \\
        max\_unroll\_allowed: unroll iterations allowed & none, medium, high & SW \\
        regroup\_only: Avoids inlining & yes, no & SW \\
        do\_classic\_opti: Classical optimizations & yes, no & SW \\
        do\_prepass\_scalar\_scheduling: Performs a prepass schedule & yes, no & SW \\
        do\_postpass\_scalar\_scheduling: Performs a schedule after & yes, no & SW \\
	do\_modulo\_scheduling: Modular scheduling of loop code & yes, no & SW \\
        memvr\_profiled: Memory-dependencies profiling & yes, no & SW \\
	\hline
	\end{tabular}}
\end{table}

% ACM table style
\begin{table}
	\tbl{Benchmarks\label{tab:bench}}{
	\begin{tabular}{ll}
	\hline
	\multicolumn{1}{c}{Benchmark} & \multicolumn{1}{c}{Application} \\
	\hline
	Alloca\_test & Memory array allocation test \\
	Bmm & Matrices multiplication and elements sum \\
	Fir\_int & Finite impulse response \\
	Mm & Floating point matrices multiplication \\
	Sha256 & Cryptocurrency header Hashing \\
	Wave & Wavefront computation \\
	\hline
	\end{tabular}}
\end{table}
%------------------------------------------------------------------------------
\subsection{Configuration Evaluation Flow}

To evaluate and compare the performance indexes of different
architectures for a specific application, one needs to simulate the
architecture running the code of the application. To make
architectural exploration possible both the compiler and the simulator
have to be retargetable. The open source project Trimaran~\cite{trimaran_hp} provides these
tools and, together with the estimation framework \ee~\cite{palpatti_estimedia03}, has been
adopted as a tested and flexible platform for carrying out the
experiments.

We will now show a functional scheme highlighting the main blocks of
the \ee{} framework and the interface with the Trimaran tools. The
input for the whole evaluation flow is the source file of an application
benchmark and the configuration of the architecture
being evaluated.  With reference to Figure~\ref{fig:evaluation_flow}
this input is represented by the blocks \emph{App.c} and
\emph{Config}.
\begin{figure}
        \centering
        \includegraphics[width=0.65\textwidth]{pictures/evaluation_flow.eps}
        \caption{Block diagram of the framework.}
        \label{fig:evaluation_flow}
\end{figure}

The application (App.c) is first compiled by the Trimaran's compiler
front-end (IMPACT). At the end of the compilation flow,
Trimaran supplies a simulation library (Emulib) which makes it
possible to execute the VLIW code produced by Elcor, generating a file
(Stats) containing the execution statistics (e.g., instruction mix,
execution cycles, utilization of functional units, etc.). A cache
simulator, along with a bus simulator, is used to gather information
about the behavior of the memory hierarchy in terms of miss rate and
data/address traffic on the interconnection buses.

Together with the configuration of the system, the statistics produced
by simulation contain all the information needed to apply the area,
performance and power consumption estimation model implemented in the
\emph{Estimator} component. These are high-level abstraction models
that allow to quickly get an estimation while still providing a
discrete level of accuracy (for a detailed description of the models
see also~\cite{palpatti_estimedia03}).

Finally, the results obtained by these
models are the input for the \emph{Explorer} component. This component
executes an optimization algorithm, the aim of which is to modify the
parameters of the configuration so as to minimize the three cost
functions (area, execution time and energy/power consumption).

%\subsection{Estimation Models}
%
%The average power consumed by the processor was estimated using an
%adaptation of the Cai-Lim model~\cite{cai_micro99} to the VLIW
%processor. As regards the cache subsystem, a transition-based model
%was used, according to the equations described
%in~\cite{kamble_islped97}.  The main memory energy is based on the
%model in~\cite{shiue_dac99} and assumes a per main memory access
%energy of $4.95 \times 10^{-9} J$ based on the data for the Cypress
%CY7C1326-133 memory chip.  The contribution towards power consumption
%made by the interconnection system was calculated by counting the
%number of transitions on the bus lines and applying the formula
%$P_{bus} = 1/2 V_{dd}^2 \alpha f C_l$ where $V_{dd}$ is the supply
%voltage, $\alpha$ is the switching activity, $f$ is the clock
%frequency and $C_l$ is the capacity of a bus line. For the on-chip
%buses we also considered the coupling capacitances between bus lines,
%using the model in~\cite{lekatsas_dac01}. 

\subsection{Comparison Setup}
\secL{comparison}

The results presented in the next section compare PS with one widely
adopted Multi-objective Genetic Algorithm~\cite{knowles_techrep06}. To
the best of our knowledge, genetic approaches in general still
represent the best compromise between efficiency (i.e. time required
for exploration) and accuracy of the reported Pareto Sets.  Note that,
for the reasons described above, mono-objective approaches have been
explicitly excluded from this analysis.  We also exclude other
approaches (e.g.  dependency based~\cite{givargis_tvlsi02}) that make
use of some a-priori knowledge about the role and the semantics of each
parameter, since our aim is evaluate the capacity of focusing on
interesting parameter regions without any external judgment or help
based on heuristics.  In this work, when referring to
genetic approach (or GA), we consider the widely spread 
\emph{SPEA2}~\cite{zitzler_eurogen01} variant of the Multi-objective
Genetic Algorithm, remanding to
works like~\cite{zitzler_ec00}~\cite{zitzler_tec03} for a detailed
analysis on how it could differ from other genetic approaches and what
could be expected from similar multi-objective strategies. 

Although an analogy between the concepts of ``era'' and ``generation''
can be considered, comparisons have been carried out using as
reference a fixed amount of total unique simulations, instead of a given
number of eras/generations . The purpose is to make the comparison
fair and accurate from the algorithm performance perspective, since a fixed budget of
different simulations is more directly related to the actual amount of time
required to execute the exploration.
%; on the other side, simply
%referring to a given number of generations/eras could be inaccurate
%when a considerable number of configurations remain unaltered in the
%current population (e.g. because of a local optimum)

The set of default algorithm tuning parameters used for both the
genetic and PS approaches is shown in
Table~\ref{tab:defaults}. Given the similarity, both in terms of
meaning and functionality, between the parameter $K$ and the population size
of the genetic approach, they both have been set to the same values. As discussed
when introducing the main PS algorithm concepts, a sensitivity analysis on the algorithm parameters is beyond the scope of this work. Nevertheless, we conducted several tests with
different values of $K$, similar to the ones shown in
Figure~\ref{fig:kimpact}, which refers to the alloca\_test application
and a total budget of 500 configurations. We found that initially increasing values
of $K$ seems to improve both Pareto front (as in the picture on the left) and objective
distribution (on the right), but, after some threshold value, $K$ does
not seem to provide any benefit, even returning to slightly worse
results for the last values of $K$ considered (100). Similar trends have been
found for the other benchmarks (not included for space reasons),
showing on average a value of $K$ around 30 as
reasonable sweet spot for the objectives. A complete resource with the
whole set of results and data obtained in the tests performed can be found online
at~\cite{ps_results}.

\begin{table}
	\tbl{Algorithm Parameters\label{tab:defaults}}{
	\begin{tabular}{cll}
	\hline
	\multicolumn{1}{c}{Algorithm} & \multicolumn{1}{c}{Parameter} & \multicolumn{1}{c}{Default Value} \\
	\hline
	GA & population size & 30 \\
	GA & crossover probability & 0.8 \\
	GA & mutation probability  & 0.1 \\
	PS & era budget ($K$) & 30 \\
	PS & innovation threshold ($\alpha$ ) & 1 \\
	both PS and GA & Total simulations budget & 100-500 \\
	\hline
	\end{tabular}}
\end{table}

\begin{figure}
  \figLC{kimpact}{Impact of increased K values on Pareto fronts and distribution for the alloca\_test application. }
  \begin{center}
    \includegraphics[width=0.45\textwidth]{pictures/alloca_K.eps} 
    \includegraphics[width=0.45\textwidth]{pictures/alloca_K_ALL.eps} 
  \end{center}
\end{figure}

%------------------------------------------------------------------------------

\subsection{Evaluation metrics}
\secL{metrics}
A qualitative evaluation of Paretos is probably the simplest way to
have a quick glance on algorithm behavior and results. However, in
tests involving a comparison between different exploration strategies,
this approach could not help in capturing the whole picture, missing
some important properties related to the distribution of the solutions
found.
In order to evaluate the Paretos from a quantitative point-of-view we
will consider two metrics, namely, the \emph{variation range} and the
\emph{average normalized absolute dispersion error}. Both of them permit to quantify the extent of the Pareto front that is an important factor to consider when evaluating its quality. Indeed, authors of \cite{zitzler_ec00,weise2012evolutionary} observe that an exploration algorithm should preferably provide Pareto fronts whose objective function range is large. In other words, following the terminology of~\cite{weise2012evolutionary}, a \emph{uniform convergence} should be guaranteed.
Pareto fronts with large objective function ranges permit to investigate in a broader way the potentials of the designed device. This is important, in general, since if only a narrow portion of the Pareto front is provided due to non-uniform convergence, the decision maker may neglect some important options~\cite{weise2012evolutionary}. Having a wide Pareto front becomes essential, in particular, when designing a device that may be required to satisfy different constraints. For example, the decision maker may want to design two different multi-purpose processors. The first design may adapt to situations in which the processor is used for real time elaboration (thus requiring low execution time) and is attached to the powerline (thus not having stringent power dissipation constraints). The second design, on the contrary, may fit situations in which the energy is provided by a battery (stringent power constraints) but only background computations are performed (long execution time is tolerable). These two processors may be successfully designed maintaining the same architecture and simply using different values for the design parameters. A wide Pareto front helps in achieving this goal.

For a given objective, the \emph{variation range} represents the ratio
between the maximum and the minimum value observed for that objective.

Another metric is the average \emph{normalized absolute dispersion
error}: it measures
the average absolute difference between the distribution of points in
the objective space and an ideal distribution in which the points are
uniformly distributed over the objective space. 
Formally, let $O$ be
the image, in the objective space, of the configurations visited by
the design space exploration. The generic element of $O$ (\ie, a
solution) is a pair $(p,t)$ where $p$ and $t$ are the average power
and execution time, respectively. The two-dimensional objective space
is then partitioned by a $M_x \times M_y$ mesh. For each tile $T_i,
\ i=1, 2, \ldots, M_xM_y$ of the mesh, let $N_i$ be the number
of points in $O$ which fall in $T_i$. The average absolute error, $E_i$, for
$T_i$ is the absolute value of the difference between $N_i$ and the
ideal number of solutions, $\overline{N}$, which should fall in $T_i$
in case of uniform distribution. Such $\overline{N}$ can be simply
computed as the ratio between the cardinality of $O$ and the number of
tiles. Thus,
\[ E_i = |N_i - \overline{N}|, \]
where $\overline{N} = |O| / (M_x M_y)$. The average
normalized absolute dispersion error ($ANADE$) is the average of $E_i$
normalized to the maximum absolute error $E_{max}$:
\[ ANADE = \frac{\sum_{i=1}^{M_xM_y} E_i/(M_xM_y)}{E_{max}}, \]
where $E_{max}$ can be computed as the average absolute error in the worst
case in which all the solutions fall in a single tile:
\[ E_{max} = \frac{(M_x M_y - 1) \overline{N} + |\overline{N} -
    |O|| }{M_x M_y}. \] 

%\subsection{Impact of the algorithm parameters}
%\secL{parameter_impact}
%\comment{IMPACT OF $K$. Si potrebbero presentare un plot in cui il numero totale di simulazioni e' fissato e vengono eseguiti diversi run dell'algoritmo con diversi valori di K. Per ogni run, il pareto set e' rappresentato nel plot, sottoforma di una curva diversa. Mi aspetterei che tutte le curve siano molto simili. Per ciascun run, il valore di variation range e average absolute dispersion error sono riportati. Mi aspetto valori simili, il che confermerebbe che la diversity nel nostro algoritmo non dipende da $K$. Ne vale la pena rifare questi run? Ci vuole troppo tempo?} 
%
%\comment{Si potrebbe mettere un plot con nelle ascisse l'era e nelle ordinate il numero di split avvenuti in quell'era. Si potrebbero disegnare diverse curve, una per ogni valore di alpha, per far vedere che il numero di split e' piu' alto q. Recommended values are close to 1, as results in \secR{ee} confirm. Using different values, if sufficiently close to 1, does not have a considerable impact on the resulting Pareto front approximation. Ne vale la pena rifare questi run? Ci vuole troppo tempo?}

\subsection{Results}

In this Section we first analyze the results of GA and PS approaches from a
qualitative perspective, showing how Pareto curves compare when
plotted on the bidimensional space of power and performance
objectives. Next, we perform a quantitative
analysis of the results in order to more accurately figure out in which
features/properties the two approaches differentiate.

\figR{pareto_fronts_100} and \figR{pareto_fronts_500} show the
resulting Pareto fronts for two scenarios of 100 and 500
simulation budget, respectively. The set of algorithm parameters involved and their
values are the ones previously shown in Table~\ref{tab:defaults}. It
should be pointed out again that the number of generations (or eras)
has been chosen so that only the given amount of different simulations is
actually performed. Thus, when accounting these simulations only
unique simulations have been considered, removing repetitions and
unfeasible configurations.

% IEEE table style
%\begin{table}
%  \centering
%  \begin{tabular}{ccc}
%    \includegraphics[width=0.30\textwidth]{pictures/alloca_100.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/bmm_100.eps} & 
%    \includegraphics[width=0.30\textwidth]{pictures/fir_int100.eps} \\
%    \includegraphics[width=0.30\textwidth]{pictures/mm_100.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/sha_100.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/wave_100.eps} 
%  \end{tabular}
%  \caption{Pareto fronts found by PS and GA for a fixed budget of 100 configurations.}
%  \label{fig:pareto_fronts_100}
%\end{table}
%
%\begin{table}
%  \centering
%  \begin{tabular}{ccc}
%    \includegraphics[width=0.30\textwidth]{pictures/alloca_500.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/bmm_500.eps} & 
%    \includegraphics[width=0.30\textwidth]{pictures/fir_int500.eps} \\
%    \includegraphics[width=0.30\textwidth]{pictures/mm_500.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/sha_500.eps} &
%    \includegraphics[width=0.30\textwidth]{pictures/wave_500.eps} 
%  \end{tabular}
%  \caption{Pareto fronts found by PS and GA for a fixed budget of 500 configurations.}
%  \label{fig:pareto_fronts_500}
%\end{table}


\begin{figure}
  \figLC{pareto_fronts_100}{Pareto fronts found by PS and GA for a
  fixed budget of 100 configurations.}
  \begin{center}
    \includegraphics[width=0.30\textwidth]{pictures/alloca_100.eps}
    \includegraphics[width=0.30\textwidth]{pictures/bmm_100_bis.eps}
    \includegraphics[width=0.30\textwidth]{pictures/fir_int100.eps}
    \includegraphics[width=0.30\textwidth]{pictures/mm_100.eps}
    \includegraphics[width=0.30\textwidth]{pictures/sha_100_bis.eps}
    \includegraphics[width=0.30\textwidth]{pictures/wave_100.eps} 
  \end{center}
\end{figure}

\begin{figure}
  \figLC{pareto_fronts_500}{Pareto fronts found by PS and GA for a
  fixed budget of 500 configurations.}
  \begin{center}
    \includegraphics[width=0.30\textwidth]{pictures/alloca_500.eps}
    \includegraphics[width=0.30\textwidth]{pictures/bmm_500_bis.eps}
    \includegraphics[width=0.30\textwidth]{pictures/fir_int500.eps}
    \includegraphics[width=0.30\textwidth]{pictures/mm_500.eps}
    \includegraphics[width=0.30\textwidth]{pictures/sha_500_bis.eps}
    \includegraphics[width=0.30\textwidth]{pictures/wave_500.eps} 
  \end{center}
\end{figure}

Low budget Pareto sets are shown in \figR{pareto_fronts_100}. Note
that with such a small number of configurations being investigated, the
two sets are often overlapping, being also strongly dependent upon the
application being considered.  These results are obtained in about 5
eras and demonstrate that, with such a limited budget, PS can compete with
genetic approach while even discovering very different Pareto fronts.
Considering the extended simulation budget as depicted in
\figR{pareto_fronts_500}, we can qualitatively observe that even if PS
does not strictly outperforms GA in terms of dominance, it shows
in some cases larger Pareto in terms of extension. We can intuitively
relate this to the ``novelty-based'' score system, so that instead of
focusing on getting better points inside a given range PS tries to
enlarge the range itself, resulting in less dense but more extended
Paretos. Of course we cannot classify this different behavior as
being strictly better or worse when compared to the genetic one: other
design factors should be evaluated on a case-by-case basis, e.g. the
desired granularity of the results, objective range constraints or
estimated error of the adopted models.

To better evaluate results from a different perspective, we
adopt the \emph{variation range} and the \emph{average absolute
dispersion error}, the two quantitative metrics described and motivated
in~\secR{metrics}. In particular, a comparison between the variation
ranges of both power dissipation and execution time, obtained by PS
and GA for different benchmarks, is shown in \figR{dispersion}(a) and~(b), respectively.  As it can be observed, the PS exploration provides
solutions which fall on a range that is, on average, wider than the
one provided by a GA exploration for power dissipation and execution
time, respectively.
%\begin{table}
%  \centering
%  \begin{tabular}{c}
%    \includegraphics[width=0.7\textwidth]{pictures/variation_power.eps} \\
%    (a) \\
%    \includegraphics[width=0.7\textwidth]{pictures/variation_etime.eps} \\
%    (b) \\
%    \includegraphics[width=0.7\textwidth]{pictures/dispersion.eps} \\
%    (c) 
%  \end{tabular}
%  \caption{(a)(b) Variation range obtained by PS and GA for different benchmarks
%  and average dispersion absolute dispersion (c)}
%  \label{fig:dispersion}
%\end{table}

\begin{figure}
  \figLC{dispersion}{(a)(b) Variation range obtained by PS and GA for
  different benchmarks and average absolute dispersion errors (c). }
  \begin{center}
    \subfigure[]{\includegraphics[width=0.7\textwidth]{pictures/variation_power.eps} }
    \subfigure[]{\includegraphics[width=0.7\textwidth]{pictures/variation_etime.eps} }
    \subfigure[]{\includegraphics[width=0.7\textwidth]{pictures/dispersion.eps} }
  \end{center}
\end{figure}

\figR{dispersion}(c) shows the average normalized absolute dispersion
errors for different benchmarks for PS and GA exploration.  As it can
be observed, GA shows a tendency in discovering more uniformly
dispersed points (i.e. higher values in the chart). These results,
together with the variation ranges of \figR{dispersion}(a) and~(b)
confirm the density vs. extension trade-offs when comparing PS and
GA approaches. We highlight that PS shows a stronger \emph{uniform convergence} (discussed in \secR{metrics}), i.e. it provides a more extended Pareto front, avoiding the designer to neglect some important options, and offering a wider view of the potentials of the designed device.

Finally, it should be pointed out that the final output of PS
execution is not only the Pareto-optimal set of configurations, but also
a partitioning of the parameter space in more or less innovative
regions, obtained as a result of the merge/splitting operators
described in~\secR{algorithm}. Exploiting the information carried out
by this partitioning could be an interesting added-value of PS to be
investigated in future developments of the proposed strategy.

%Nevertheless, the proposed algorithm introduces a new
%perspective on the allocation of a simulations budget, paving the way for
%the development of further strategies for addressing the problem of
%multi-objective hardware/software design space exploration.


